Name: Ananya Kotla¶

Stanford Pre-Collegiate Summer Institutes (Intro to Machine Learning) ¶

Project Title: Breast Cancer Diagnostic Prediction¶

Dataset Link: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data?resource=download¶

Description of Dataset:¶

The Breast Cancer Wisconsin (Diagnostic) Data Set is used to predict whether a tumor is malignant (cancerous) or benign (non-cancerous) based on features that describe characteristics of the cell nuclei. These features, included in the data set, are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.

Number of Instances: 569

Variables:

  • id (ID): ID number
  • diagnosis (TARGET): The diagnosis of breast tissues (M = malignant, B = benign)
  • radius_mean: The mean of distances from center to points on the perimeter
  • texture_mean: The standard deviation of gray-scale values
  • perimeter_mean: The mean size of the core tumor
  • area_mean: The average area of the nucleus
  • smoothness_mean: The mean of local variation in radius lengths
  • compactness_mean: The mean of perimeter^2 / area - 1.0 of the nucleus shape
  • concavity_mean: The mean of severity of concave portions of the contour
  • concave points_mean: The mean for number of concave portions of the contour
  • symmetry_mean: The mean for symmetry of the nucleus shape
  • fractal_dimension_mean: The mean for "coastline approximation" - 1
  • radius_se: Standard error of radius_mean
  • texture_se: Standard error of the texture_mean
  • perimeter_se: Standard error of the perimeter_mean
  • area_se: Standard error of the area_mean
  • smoothness_se: Standard error of the smoothness_mean
  • compactness_se: Standard error of the compactness_mean
  • concavity_se: Standard error of the concavity_mean
  • concave points_se: Standard error of the concave points_mean
  • symmetry_se: Standard error of the symmetry_mean
  • fractal_dimension_se: Standard error of the fractal_dimension_mean
  • radius_worst: "Worst" or largest mean value for the radius_mean
  • texture_worst — "Worst" or largest mean value for the texture_mean
  • perimeter_worst — "Worst" or largest mean value for the perimeter_mean
  • area_worst — "Worst" or largest mean value for the area_mean
  • smoothness_worst — "Worst" or largest mean value for the smoothness_mean
  • compactness_worst — "Worst" or largest mean value for the compactness_mean
  • concavity_worst — "Worst" or largest mean value for the concavity_mean
  • concave points_worst — "Worst" or largest mean value for the concave points_mean
  • symmetry_worst — "Worst" or largest mean value for the symmetry_mean
  • fractal_dimension_worst — "Worst" or largest mean value for thefractal_dimension_mean
In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import scipy.stats as stats
from scipy.stats import skew, norm, probplot
from matplotlib.pyplot import boxplot
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
In [2]:
# Import 
df = pd.read_csv("data.csv")
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
 32  Unnamed: 32              0 non-null      float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB

Data Cleaning¶

In [4]:
# Cleaning Data
# Removed 2 columns, an empty one and an id column, because it doesn't affect the diagnosis variable 
del df['Unnamed: 32']
del df['id']
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  569 non-null    float64
 15  smoothness_se            569 non-null    float64
 16  compactness_se           569 non-null    float64
 17  concavity_se             569 non-null    float64
 18  concave points_se        569 non-null    float64
 19  symmetry_se              569 non-null    float64
 20  fractal_dimension_se     569 non-null    float64
 21  radius_worst             569 non-null    float64
 22  texture_worst            569 non-null    float64
 23  perimeter_worst          569 non-null    float64
 24  area_worst               569 non-null    float64
 25  smoothness_worst         569 non-null    float64
 26  compactness_worst        569 non-null    float64
 27  concavity_worst          569 non-null    float64
 28  concave points_worst     569 non-null    float64
 29  symmetry_worst           569 non-null    float64
 30  fractal_dimension_worst  569 non-null    float64
dtypes: float64(30), object(1)
memory usage: 137.9+ KB
In [6]:
# Convert 'diagnosis' (TARGET) column: M → 1, B → 0
# It will be easier to use for models
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    int64  
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  569 non-null    float64
 15  smoothness_se            569 non-null    float64
 16  compactness_se           569 non-null    float64
 17  concavity_se             569 non-null    float64
 18  concave points_se        569 non-null    float64
 19  symmetry_se              569 non-null    float64
 20  fractal_dimension_se     569 non-null    float64
 21  radius_worst             569 non-null    float64
 22  texture_worst            569 non-null    float64
 23  perimeter_worst          569 non-null    float64
 24  area_worst               569 non-null    float64
 25  smoothness_worst         569 non-null    float64
 26  compactness_worst        569 non-null    float64
 27  concavity_worst          569 non-null    float64
 28  concave points_worst     569 non-null    float64
 29  symmetry_worst           569 non-null    float64
 30  fractal_dimension_worst  569 non-null    float64
dtypes: float64(30), int64(1)
memory usage: 137.9 KB
In [8]:
df.head()
Out[8]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 1 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 1 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 1 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 1 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 31 columns

In [9]:
df.tail()
Out[9]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
564 1 21.56 22.39 142.00 1479.0 0.11100 0.11590 0.24390 0.13890 0.1726 ... 25.450 26.40 166.10 2027.0 0.14100 0.21130 0.4107 0.2216 0.2060 0.07115
565 1 20.13 28.25 131.20 1261.0 0.09780 0.10340 0.14400 0.09791 0.1752 ... 23.690 38.25 155.00 1731.0 0.11660 0.19220 0.3215 0.1628 0.2572 0.06637
566 1 16.60 28.08 108.30 858.1 0.08455 0.10230 0.09251 0.05302 0.1590 ... 18.980 34.12 126.70 1124.0 0.11390 0.30940 0.3403 0.1418 0.2218 0.07820
567 1 20.60 29.33 140.10 1265.0 0.11780 0.27700 0.35140 0.15200 0.2397 ... 25.740 39.42 184.60 1821.0 0.16500 0.86810 0.9387 0.2650 0.4087 0.12400
568 0 7.76 24.54 47.92 181.0 0.05263 0.04362 0.00000 0.00000 0.1587 ... 9.456 30.37 59.16 268.6 0.08996 0.06444 0.0000 0.0000 0.2871 0.07039

5 rows × 31 columns

Descriptive Statistcs¶

In [10]:
# Calculated to gain insights intro data distribution
df.mean()
Out[10]:
diagnosis                    0.372583
radius_mean                 14.127292
texture_mean                19.289649
perimeter_mean              91.969033
area_mean                  654.889104
smoothness_mean              0.096360
compactness_mean             0.104341
concavity_mean               0.088799
concave points_mean          0.048919
symmetry_mean                0.181162
fractal_dimension_mean       0.062798
radius_se                    0.405172
texture_se                   1.216853
perimeter_se                 2.866059
area_se                     40.337079
smoothness_se                0.007041
compactness_se               0.025478
concavity_se                 0.031894
concave points_se            0.011796
symmetry_se                  0.020542
fractal_dimension_se         0.003795
radius_worst                16.269190
texture_worst               25.677223
perimeter_worst            107.261213
area_worst                 880.583128
smoothness_worst             0.132369
compactness_worst            0.254265
concavity_worst              0.272188
concave points_worst         0.114606
symmetry_worst               0.290076
fractal_dimension_worst      0.083946
dtype: float64
In [11]:
df.median()
Out[11]:
diagnosis                    0.000000
radius_mean                 13.370000
texture_mean                18.840000
perimeter_mean              86.240000
area_mean                  551.100000
smoothness_mean              0.095870
compactness_mean             0.092630
concavity_mean               0.061540
concave points_mean          0.033500
symmetry_mean                0.179200
fractal_dimension_mean       0.061540
radius_se                    0.324200
texture_se                   1.108000
perimeter_se                 2.287000
area_se                     24.530000
smoothness_se                0.006380
compactness_se               0.020450
concavity_se                 0.025890
concave points_se            0.010930
symmetry_se                  0.018730
fractal_dimension_se         0.003187
radius_worst                14.970000
texture_worst               25.410000
perimeter_worst             97.660000
area_worst                 686.500000
smoothness_worst             0.131300
compactness_worst            0.211900
concavity_worst              0.226700
concave points_worst         0.099930
symmetry_worst               0.282200
fractal_dimension_worst      0.080040
dtype: float64
In [12]:
df.mode()
Out[12]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 0.0 12.34 14.93 82.61 512.2 0.1007 0.1147 0.0 0.0 0.1601 ... 12.36 17.70 101.7 284.4 0.1216 0.1486 0.0 0.0 0.2226 0.07427
1 NaN NaN 15.70 87.76 NaN NaN 0.1206 NaN NaN 0.1714 ... NaN 27.26 105.9 402.8 0.1223 0.3416 NaN NaN 0.2369 NaN
2 NaN NaN 16.84 134.70 NaN NaN NaN NaN NaN 0.1717 ... NaN NaN 117.7 439.6 0.1234 NaN NaN NaN 0.2383 NaN
3 NaN NaN 16.85 NaN NaN NaN NaN NaN NaN 0.1769 ... NaN NaN NaN 458.0 0.1256 NaN NaN NaN 0.2972 NaN
4 NaN NaN 17.46 NaN NaN NaN NaN NaN NaN 0.1893 ... NaN NaN NaN 472.4 0.1275 NaN NaN NaN 0.3109 NaN
5 NaN NaN 18.22 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 489.5 0.1312 NaN NaN NaN 0.3196 NaN
6 NaN NaN 18.90 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 546.7 0.1347 NaN NaN NaN NaN NaN
7 NaN NaN 19.83 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 547.4 0.1401 NaN NaN NaN NaN NaN
8 NaN NaN 20.52 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 624.1 0.1415 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 698.8 NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 706.0 NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 708.8 NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 725.9 NaN NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 733.5 NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 749.9 NaN NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 808.9 NaN NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 826.4 NaN NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 830.5 NaN NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 1210.0 NaN NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 1261.0 NaN NaN NaN NaN NaN NaN
20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 1269.0 NaN NaN NaN NaN NaN NaN
21 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 1437.0 NaN NaN NaN NaN NaN NaN
22 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 1603.0 NaN NaN NaN NaN NaN NaN
23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 1623.0 NaN NaN NaN NaN NaN NaN
24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 1750.0 NaN NaN NaN NaN NaN NaN

25 rows × 31 columns

In [13]:
df.std()
Out[13]:
diagnosis                    0.483918
radius_mean                  3.524049
texture_mean                 4.301036
perimeter_mean              24.298981
area_mean                  351.914129
smoothness_mean              0.014064
compactness_mean             0.052813
concavity_mean               0.079720
concave points_mean          0.038803
symmetry_mean                0.027414
fractal_dimension_mean       0.007060
radius_se                    0.277313
texture_se                   0.551648
perimeter_se                 2.021855
area_se                     45.491006
smoothness_se                0.003003
compactness_se               0.017908
concavity_se                 0.030186
concave points_se            0.006170
symmetry_se                  0.008266
fractal_dimension_se         0.002646
radius_worst                 4.833242
texture_worst                6.146258
perimeter_worst             33.602542
area_worst                 569.356993
smoothness_worst             0.022832
compactness_worst            0.157336
concavity_worst              0.208624
concave points_worst         0.065732
symmetry_worst               0.061867
fractal_dimension_worst      0.018061
dtype: float64
In [14]:
df.std()*3
Out[14]:
diagnosis                     1.451754
radius_mean                  10.572146
texture_mean                 12.903107
perimeter_mean               72.896943
area_mean                  1055.742388
smoothness_mean               0.042192
compactness_mean              0.158438
concavity_mean                0.239159
concave points_mean           0.116409
symmetry_mean                 0.082243
fractal_dimension_mean        0.021181
radius_se                     0.831938
texture_se                    1.654945
perimeter_se                  6.065564
area_se                     136.473017
smoothness_se                 0.009008
compactness_se                0.053725
concavity_se                  0.090558
concave points_se             0.018511
symmetry_se                   0.024799
fractal_dimension_se          0.007938
radius_worst                 14.499725
texture_worst                18.438773
perimeter_worst             100.807627
area_worst                 1708.070978
smoothness_worst              0.068497
compactness_worst             0.472009
concavity_worst               0.625873
concave points_worst          0.197197
symmetry_worst                0.185602
fractal_dimension_worst       0.054184
dtype: float64
In [15]:
df.columns
Out[15]:
Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')
In [16]:
# Number of rows, columns
df.shape
Out[16]:
(569, 31)
In [17]:
df.describe()
Out[17]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 0.372583 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 0.483918 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 0.000000 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 0.000000 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 0.000000 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 1.000000 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 1.000000 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 31 columns

In [18]:
df.describe().T
Out[18]:
count mean std min 25% 50% 75% max
diagnosis 569.0 0.372583 0.483918 0.000000 0.000000 0.000000 1.000000 1.00000
radius_mean 569.0 14.127292 3.524049 6.981000 11.700000 13.370000 15.780000 28.11000
texture_mean 569.0 19.289649 4.301036 9.710000 16.170000 18.840000 21.800000 39.28000
perimeter_mean 569.0 91.969033 24.298981 43.790000 75.170000 86.240000 104.100000 188.50000
area_mean 569.0 654.889104 351.914129 143.500000 420.300000 551.100000 782.700000 2501.00000
smoothness_mean 569.0 0.096360 0.014064 0.052630 0.086370 0.095870 0.105300 0.16340
compactness_mean 569.0 0.104341 0.052813 0.019380 0.064920 0.092630 0.130400 0.34540
concavity_mean 569.0 0.088799 0.079720 0.000000 0.029560 0.061540 0.130700 0.42680
concave points_mean 569.0 0.048919 0.038803 0.000000 0.020310 0.033500 0.074000 0.20120
symmetry_mean 569.0 0.181162 0.027414 0.106000 0.161900 0.179200 0.195700 0.30400
fractal_dimension_mean 569.0 0.062798 0.007060 0.049960 0.057700 0.061540 0.066120 0.09744
radius_se 569.0 0.405172 0.277313 0.111500 0.232400 0.324200 0.478900 2.87300
texture_se 569.0 1.216853 0.551648 0.360200 0.833900 1.108000 1.474000 4.88500
perimeter_se 569.0 2.866059 2.021855 0.757000 1.606000 2.287000 3.357000 21.98000
area_se 569.0 40.337079 45.491006 6.802000 17.850000 24.530000 45.190000 542.20000
smoothness_se 569.0 0.007041 0.003003 0.001713 0.005169 0.006380 0.008146 0.03113
compactness_se 569.0 0.025478 0.017908 0.002252 0.013080 0.020450 0.032450 0.13540
concavity_se 569.0 0.031894 0.030186 0.000000 0.015090 0.025890 0.042050 0.39600
concave points_se 569.0 0.011796 0.006170 0.000000 0.007638 0.010930 0.014710 0.05279
symmetry_se 569.0 0.020542 0.008266 0.007882 0.015160 0.018730 0.023480 0.07895
fractal_dimension_se 569.0 0.003795 0.002646 0.000895 0.002248 0.003187 0.004558 0.02984
radius_worst 569.0 16.269190 4.833242 7.930000 13.010000 14.970000 18.790000 36.04000
texture_worst 569.0 25.677223 6.146258 12.020000 21.080000 25.410000 29.720000 49.54000
perimeter_worst 569.0 107.261213 33.602542 50.410000 84.110000 97.660000 125.400000 251.20000
area_worst 569.0 880.583128 569.356993 185.200000 515.300000 686.500000 1084.000000 4254.00000
smoothness_worst 569.0 0.132369 0.022832 0.071170 0.116600 0.131300 0.146000 0.22260
compactness_worst 569.0 0.254265 0.157336 0.027290 0.147200 0.211900 0.339100 1.05800
concavity_worst 569.0 0.272188 0.208624 0.000000 0.114500 0.226700 0.382900 1.25200
concave points_worst 569.0 0.114606 0.065732 0.000000 0.064930 0.099930 0.161400 0.29100
symmetry_worst 569.0 0.290076 0.061867 0.156500 0.250400 0.282200 0.317900 0.66380
fractal_dimension_worst 569.0 0.083946 0.018061 0.055040 0.071460 0.080040 0.092080 0.20750
In [19]:
# Reduce number of decimals
pd.set_option('display.float_format', lambda x: '%.2f' % x)
df.describe().T
Out[19]:
count mean std min 25% 50% 75% max
diagnosis 569.00 0.37 0.48 0.00 0.00 0.00 1.00 1.00
radius_mean 569.00 14.13 3.52 6.98 11.70 13.37 15.78 28.11
texture_mean 569.00 19.29 4.30 9.71 16.17 18.84 21.80 39.28
perimeter_mean 569.00 91.97 24.30 43.79 75.17 86.24 104.10 188.50
area_mean 569.00 654.89 351.91 143.50 420.30 551.10 782.70 2501.00
smoothness_mean 569.00 0.10 0.01 0.05 0.09 0.10 0.11 0.16
compactness_mean 569.00 0.10 0.05 0.02 0.06 0.09 0.13 0.35
concavity_mean 569.00 0.09 0.08 0.00 0.03 0.06 0.13 0.43
concave points_mean 569.00 0.05 0.04 0.00 0.02 0.03 0.07 0.20
symmetry_mean 569.00 0.18 0.03 0.11 0.16 0.18 0.20 0.30
fractal_dimension_mean 569.00 0.06 0.01 0.05 0.06 0.06 0.07 0.10
radius_se 569.00 0.41 0.28 0.11 0.23 0.32 0.48 2.87
texture_se 569.00 1.22 0.55 0.36 0.83 1.11 1.47 4.88
perimeter_se 569.00 2.87 2.02 0.76 1.61 2.29 3.36 21.98
area_se 569.00 40.34 45.49 6.80 17.85 24.53 45.19 542.20
smoothness_se 569.00 0.01 0.00 0.00 0.01 0.01 0.01 0.03
compactness_se 569.00 0.03 0.02 0.00 0.01 0.02 0.03 0.14
concavity_se 569.00 0.03 0.03 0.00 0.02 0.03 0.04 0.40
concave points_se 569.00 0.01 0.01 0.00 0.01 0.01 0.01 0.05
symmetry_se 569.00 0.02 0.01 0.01 0.02 0.02 0.02 0.08
fractal_dimension_se 569.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03
radius_worst 569.00 16.27 4.83 7.93 13.01 14.97 18.79 36.04
texture_worst 569.00 25.68 6.15 12.02 21.08 25.41 29.72 49.54
perimeter_worst 569.00 107.26 33.60 50.41 84.11 97.66 125.40 251.20
area_worst 569.00 880.58 569.36 185.20 515.30 686.50 1084.00 4254.00
smoothness_worst 569.00 0.13 0.02 0.07 0.12 0.13 0.15 0.22
compactness_worst 569.00 0.25 0.16 0.03 0.15 0.21 0.34 1.06
concavity_worst 569.00 0.27 0.21 0.00 0.11 0.23 0.38 1.25
concave points_worst 569.00 0.11 0.07 0.00 0.06 0.10 0.16 0.29
symmetry_worst 569.00 0.29 0.06 0.16 0.25 0.28 0.32 0.66
fractal_dimension_worst 569.00 0.08 0.02 0.06 0.07 0.08 0.09 0.21
In [20]:
df.diagnosis.value_counts()
Out[20]:
diagnosis
0    357
1    212
Name: count, dtype: int64
In [21]:
df["diagnosis"].value_counts()
Out[21]:
diagnosis
0    357
1    212
Name: count, dtype: int64
In [22]:
# A random set of 15 rows from the DataFrame will be displayed.
np.random.seed()
df.sample(n=15)
Out[22]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
150 0 13.00 20.78 83.51 519.40 0.11 0.08 0.03 0.03 0.25 ... 14.16 24.11 90.82 616.70 0.13 0.11 0.08 0.06 0.32 0.06
515 0 11.34 18.61 72.76 391.20 0.10 0.08 0.04 0.03 0.19 ... 12.47 23.03 79.15 478.60 0.15 0.16 0.16 0.09 0.31 0.07
470 0 9.67 18.49 61.49 289.10 0.09 0.06 0.03 0.02 0.22 ... 11.14 25.62 70.88 385.20 0.12 0.15 0.13 0.07 0.32 0.09
31 1 11.84 18.70 77.93 440.60 0.11 0.15 0.12 0.05 0.23 ... 16.82 28.12 119.40 888.70 0.16 0.58 0.70 0.15 0.48 0.14
417 1 15.50 21.08 102.90 803.10 0.11 0.16 0.15 0.08 0.21 ... 23.17 27.65 157.10 1748.00 0.15 0.40 0.42 0.21 0.30 0.10
78 1 20.18 23.97 143.70 1245.00 0.13 0.35 0.38 0.16 0.29 ... 23.37 31.72 170.30 1623.00 0.16 0.62 0.77 0.25 0.54 0.10
195 0 12.91 16.33 82.53 516.40 0.08 0.05 0.04 0.02 0.18 ... 13.88 22.00 90.81 600.60 0.11 0.15 0.18 0.08 0.30 0.07
234 0 9.57 15.91 60.21 279.60 0.08 0.04 0.02 0.02 0.16 ... 10.51 19.16 65.74 335.90 0.15 0.10 0.07 0.07 0.28 0.08
506 0 12.22 20.04 79.47 453.10 0.11 0.12 0.08 0.02 0.21 ... 13.16 24.17 85.13 515.30 0.14 0.23 0.35 0.08 0.27 0.09
406 0 16.14 14.86 104.30 800.00 0.09 0.09 0.06 0.05 0.17 ... 17.71 19.58 115.90 947.90 0.12 0.17 0.23 0.11 0.28 0.07
40 1 13.44 21.58 86.18 563.00 0.08 0.06 0.03 0.02 0.18 ... 15.93 30.25 102.50 787.90 0.11 0.20 0.21 0.11 0.30 0.07
77 1 18.05 16.15 120.20 1006.00 0.11 0.21 0.17 0.11 0.22 ... 22.39 18.91 150.10 1610.00 0.15 0.56 0.38 0.21 0.38 0.11
4 1 20.29 14.34 135.10 1297.00 0.10 0.13 0.20 0.10 0.18 ... 22.54 16.67 152.20 1575.00 0.14 0.20 0.40 0.16 0.24 0.08
87 1 19.02 24.59 122.00 1076.00 0.09 0.12 0.15 0.08 0.20 ... 24.56 30.41 152.90 1623.00 0.12 0.32 0.58 0.20 0.40 0.09
360 0 12.54 18.07 79.42 491.90 0.07 0.03 0.00 0.01 0.15 ... 13.72 20.98 86.82 585.70 0.09 0.04 0.00 0.02 0.22 0.06

15 rows × 31 columns

In [23]:
# Checking for duplicated data
df[df.duplicated()].count()
Out[23]:
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

Exploratory Data Analysis¶

In [24]:
# To explore relationships between variables from the data
# Pearson's Correlation
df.corr()
Out[24]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
diagnosis 1.00 0.73 0.42 0.74 0.71 0.36 0.60 0.70 0.78 0.33 ... 0.78 0.46 0.78 0.73 0.42 0.59 0.66 0.79 0.42 0.32
radius_mean 0.73 1.00 0.32 1.00 0.99 0.17 0.51 0.68 0.82 0.15 ... 0.97 0.30 0.97 0.94 0.12 0.41 0.53 0.74 0.16 0.01
texture_mean 0.42 0.32 1.00 0.33 0.32 -0.02 0.24 0.30 0.29 0.07 ... 0.35 0.91 0.36 0.34 0.08 0.28 0.30 0.30 0.11 0.12
perimeter_mean 0.74 1.00 0.33 1.00 0.99 0.21 0.56 0.72 0.85 0.18 ... 0.97 0.30 0.97 0.94 0.15 0.46 0.56 0.77 0.19 0.05
area_mean 0.71 0.99 0.32 0.99 1.00 0.18 0.50 0.69 0.82 0.15 ... 0.96 0.29 0.96 0.96 0.12 0.39 0.51 0.72 0.14 0.00
smoothness_mean 0.36 0.17 -0.02 0.21 0.18 1.00 0.66 0.52 0.55 0.56 ... 0.21 0.04 0.24 0.21 0.81 0.47 0.43 0.50 0.39 0.50
compactness_mean 0.60 0.51 0.24 0.56 0.50 0.66 1.00 0.88 0.83 0.60 ... 0.54 0.25 0.59 0.51 0.57 0.87 0.82 0.82 0.51 0.69
concavity_mean 0.70 0.68 0.30 0.72 0.69 0.52 0.88 1.00 0.92 0.50 ... 0.69 0.30 0.73 0.68 0.45 0.75 0.88 0.86 0.41 0.51
concave points_mean 0.78 0.82 0.29 0.85 0.82 0.55 0.83 0.92 1.00 0.46 ... 0.83 0.29 0.86 0.81 0.45 0.67 0.75 0.91 0.38 0.37
symmetry_mean 0.33 0.15 0.07 0.18 0.15 0.56 0.60 0.50 0.46 1.00 ... 0.19 0.09 0.22 0.18 0.43 0.47 0.43 0.43 0.70 0.44
fractal_dimension_mean -0.01 -0.31 -0.08 -0.26 -0.28 0.58 0.57 0.34 0.17 0.48 ... -0.25 -0.05 -0.21 -0.23 0.50 0.46 0.35 0.18 0.33 0.77
radius_se 0.57 0.68 0.28 0.69 0.73 0.30 0.50 0.63 0.70 0.30 ... 0.72 0.19 0.72 0.75 0.14 0.29 0.38 0.53 0.09 0.05
texture_se -0.01 -0.10 0.39 -0.09 -0.07 0.07 0.05 0.08 0.02 0.13 ... -0.11 0.41 -0.10 -0.08 -0.07 -0.09 -0.07 -0.12 -0.13 -0.05
perimeter_se 0.56 0.67 0.28 0.69 0.73 0.30 0.55 0.66 0.71 0.31 ... 0.70 0.20 0.72 0.73 0.13 0.34 0.42 0.55 0.11 0.09
area_se 0.55 0.74 0.26 0.74 0.80 0.25 0.46 0.62 0.69 0.22 ... 0.76 0.20 0.76 0.81 0.13 0.28 0.39 0.54 0.07 0.02
smoothness_se -0.07 -0.22 0.01 -0.20 -0.17 0.33 0.14 0.10 0.03 0.19 ... -0.23 -0.07 -0.22 -0.18 0.31 -0.06 -0.06 -0.10 -0.11 0.10
compactness_se 0.29 0.21 0.19 0.25 0.21 0.32 0.74 0.67 0.49 0.42 ... 0.20 0.14 0.26 0.20 0.23 0.68 0.64 0.48 0.28 0.59
concavity_se 0.25 0.19 0.14 0.23 0.21 0.25 0.57 0.69 0.44 0.34 ... 0.19 0.10 0.23 0.19 0.17 0.48 0.66 0.44 0.20 0.44
concave points_se 0.41 0.38 0.16 0.41 0.37 0.38 0.64 0.68 0.62 0.39 ... 0.36 0.09 0.39 0.34 0.22 0.45 0.55 0.60 0.14 0.31
symmetry_se -0.01 -0.10 0.01 -0.08 -0.07 0.20 0.23 0.18 0.10 0.45 ... -0.13 -0.08 -0.10 -0.11 -0.01 0.06 0.04 -0.03 0.39 0.08
fractal_dimension_se 0.08 -0.04 0.05 -0.01 -0.02 0.28 0.51 0.45 0.26 0.33 ... -0.04 -0.00 -0.00 -0.02 0.17 0.39 0.38 0.22 0.11 0.59
radius_worst 0.78 0.97 0.35 0.97 0.96 0.21 0.54 0.69 0.83 0.19 ... 1.00 0.36 0.99 0.98 0.22 0.48 0.57 0.79 0.24 0.09
texture_worst 0.46 0.30 0.91 0.30 0.29 0.04 0.25 0.30 0.29 0.09 ... 0.36 1.00 0.37 0.35 0.23 0.36 0.37 0.36 0.23 0.22
perimeter_worst 0.78 0.97 0.36 0.97 0.96 0.24 0.59 0.73 0.86 0.22 ... 0.99 0.37 1.00 0.98 0.24 0.53 0.62 0.82 0.27 0.14
area_worst 0.73 0.94 0.34 0.94 0.96 0.21 0.51 0.68 0.81 0.18 ... 0.98 0.35 0.98 1.00 0.21 0.44 0.54 0.75 0.21 0.08
smoothness_worst 0.42 0.12 0.08 0.15 0.12 0.81 0.57 0.45 0.45 0.43 ... 0.22 0.23 0.24 0.21 1.00 0.57 0.52 0.55 0.49 0.62
compactness_worst 0.59 0.41 0.28 0.46 0.39 0.47 0.87 0.75 0.67 0.47 ... 0.48 0.36 0.53 0.44 0.57 1.00 0.89 0.80 0.61 0.81
concavity_worst 0.66 0.53 0.30 0.56 0.51 0.43 0.82 0.88 0.75 0.43 ... 0.57 0.37 0.62 0.54 0.52 0.89 1.00 0.86 0.53 0.69
concave points_worst 0.79 0.74 0.30 0.77 0.72 0.50 0.82 0.86 0.91 0.43 ... 0.79 0.36 0.82 0.75 0.55 0.80 0.86 1.00 0.50 0.51
symmetry_worst 0.42 0.16 0.11 0.19 0.14 0.39 0.51 0.41 0.38 0.70 ... 0.24 0.23 0.27 0.21 0.49 0.61 0.53 0.50 1.00 0.54
fractal_dimension_worst 0.32 0.01 0.12 0.05 0.00 0.50 0.69 0.51 0.37 0.44 ... 0.09 0.22 0.14 0.08 0.62 0.81 0.69 0.51 0.54 1.00

31 rows × 31 columns

In [25]:
# Pearson's Correlation Heatmap

plt.figure(figsize=(50,35))
sns.set(font_scale= 1.8)
plt.rcParams["axes.labelsize"] = 10
sns.heatmap(df.corr(), annot=True);
plt.show();
No description has been provided for this image
In [26]:
# Data Cleaning
vars_to_remove = [
    # features that were highly correlated (greater than 0.95) were removed
    'perimeter_mean', 'area_mean', 'perimeter_worst', 'area_worst', 'radius_worst', 'radius_se'
]
df = df.drop(columns=vars_to_remove)
In [27]:
# Creates a histogram using the values from the "radius_mean" column of the DataFrame 
plt.hist(df["radius_mean"], color = 'b');
plt.show();
No description has been provided for this image
In [28]:
# Creates a distribution plot of the "radius_mean" column from the DataFrame
sns.distplot(df["radius_mean"], color='r', rug=True);
plt.show();
No description has been provided for this image
In [29]:
# Pearson's Correlation Heatmap

plt.figure(figsize=(50,35))
sns.set(font_scale= 1.8)
plt.rcParams["axes.labelsize"] = 0.8
sns.heatmap(df.corr(), annot=True, cmap = "coolwarm");
plt.show();
No description has been provided for this image
In [30]:
# Spearman's Rank or Spearman's Rho correlation
plt.figure(figsize=(50,35))
sns.set(font_scale= 1.8)
plt.rcParams["axes.labelsize"] = 0.8
sns.heatmap(df.corr(method='spearman'), annot=True, cmap="coolwarm"); # nonparametric correlation
plt.show()
No description has been provided for this image
In [31]:
# Phi K correlations for all variables
! pip install phik
import phik
from phik import resources, report
Requirement already satisfied: phik in /opt/anaconda3/lib/python3.12/site-packages (0.12.4)
Requirement already satisfied: numpy>=1.18.0 in /opt/anaconda3/lib/python3.12/site-packages (from phik) (1.26.4)
Requirement already satisfied: scipy>=1.5.2 in /opt/anaconda3/lib/python3.12/site-packages (from phik) (1.13.1)
Requirement already satisfied: pandas>=0.25.1 in /opt/anaconda3/lib/python3.12/site-packages (from phik) (2.2.2)
Requirement already satisfied: matplotlib>=2.2.3 in /opt/anaconda3/lib/python3.12/site-packages (from phik) (3.9.2)
Requirement already satisfied: joblib>=0.14.1 in /opt/anaconda3/lib/python3.12/site-packages (from phik) (1.4.2)
Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (24.1)
Requirement already satisfied: pillow>=8 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.12/site-packages (from pandas>=0.25.1->phik) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.12/site-packages (from pandas>=0.25.1->phik) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib>=2.2.3->phik) (1.16.0)
In [32]:
df.phik_matrix()
interval columns not set, guessing: ['diagnosis', 'radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'texture_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
Out[32]:
diagnosis radius_mean texture_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean texture_se ... concave points_se symmetry_se fractal_dimension_se texture_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
diagnosis 1.00 0.93 0.61 0.47 0.78 0.92 0.96 0.43 0.21 0.10 ... 0.46 0.21 0.26 0.61 0.55 0.76 0.90 0.97 0.53 0.32
radius_mean 0.93 1.00 0.30 0.41 0.59 0.74 0.85 0.32 0.47 0.36 ... 0.35 0.36 0.21 0.30 0.23 0.52 0.61 0.76 0.16 0.19
texture_mean 0.61 0.30 1.00 0.00 0.32 0.40 0.36 0.16 0.00 0.36 ... 0.16 0.00 0.00 0.93 0.20 0.42 0.48 0.41 0.00 0.00
smoothness_mean 0.47 0.41 0.00 1.00 0.78 0.60 0.60 0.65 0.72 0.50 ... 0.37 0.44 0.34 0.00 0.82 0.53 0.46 0.53 0.68 0.57
compactness_mean 0.78 0.59 0.32 0.78 1.00 0.88 0.82 0.78 0.66 0.38 ... 0.59 0.69 0.48 0.18 0.67 0.86 0.86 0.83 0.73 0.62
concavity_mean 0.92 0.74 0.40 0.60 0.88 1.00 0.92 0.65 0.52 0.42 ... 0.63 0.45 0.53 0.36 0.50 0.76 0.86 0.84 0.53 0.45
concave points_mean 0.96 0.85 0.36 0.60 0.82 0.92 1.00 0.53 0.29 0.32 ... 0.57 0.28 0.26 0.35 0.47 0.69 0.77 0.90 0.50 0.34
symmetry_mean 0.43 0.32 0.16 0.65 0.78 0.65 0.53 1.00 0.59 0.45 ... 0.43 0.61 0.37 0.17 0.50 0.55 0.50 0.54 0.82 0.41
fractal_dimension_mean 0.21 0.47 0.00 0.72 0.66 0.52 0.29 0.59 1.00 0.15 ... 0.46 0.44 0.71 0.19 0.67 0.57 0.56 0.22 0.63 0.68
texture_se 0.10 0.36 0.36 0.50 0.38 0.42 0.32 0.45 0.15 1.00 ... 0.57 0.65 0.43 0.38 0.35 0.00 0.00 0.18 0.20 0.00
perimeter_se 0.58 0.71 0.27 0.30 0.54 0.67 0.71 0.39 0.20 0.34 ... 0.52 0.44 0.28 0.22 0.08 0.32 0.40 0.52 0.17 0.10
area_se 0.79 0.79 0.31 0.44 0.50 0.67 0.69 0.37 0.15 0.55 ... 0.43 0.38 0.20 0.33 0.09 0.29 0.40 0.52 0.13 0.00
smoothness_se 0.10 0.47 0.00 0.52 0.36 0.41 0.31 0.44 0.52 0.82 ... 0.69 0.69 0.58 0.00 0.53 0.00 0.00 0.19 0.00 0.17
compactness_se 0.34 0.28 0.19 0.41 0.64 0.65 0.45 0.44 0.55 0.55 ... 0.83 0.64 0.74 0.20 0.32 0.66 0.63 0.41 0.51 0.73
concavity_se 0.40 0.32 0.07 0.27 0.49 0.68 0.43 0.42 0.57 0.47 ... 0.85 0.44 0.81 0.00 0.07 0.37 0.69 0.34 0.29 0.40
concave points_se 0.46 0.35 0.16 0.37 0.59 0.63 0.57 0.43 0.46 0.57 ... 1.00 0.52 0.78 0.12 0.26 0.41 0.56 0.61 0.17 0.36
symmetry_se 0.21 0.36 0.00 0.44 0.69 0.45 0.28 0.61 0.44 0.65 ... 0.52 1.00 0.42 0.00 0.32 0.37 0.36 0.25 0.68 0.66
fractal_dimension_se 0.26 0.21 0.00 0.34 0.48 0.53 0.26 0.37 0.71 0.43 ... 0.78 0.42 1.00 0.04 0.27 0.39 0.46 0.20 0.16 0.53
texture_worst 0.61 0.30 0.93 0.00 0.18 0.36 0.35 0.17 0.19 0.38 ... 0.12 0.00 0.04 1.00 0.31 0.53 0.57 0.41 0.21 0.27
smoothness_worst 0.55 0.23 0.20 0.82 0.67 0.50 0.47 0.50 0.67 0.35 ... 0.26 0.32 0.27 0.31 1.00 0.69 0.57 0.59 0.69 0.63
compactness_worst 0.76 0.52 0.42 0.53 0.86 0.76 0.69 0.55 0.57 0.00 ... 0.41 0.37 0.39 0.53 0.69 1.00 0.93 0.79 0.75 0.80
concavity_worst 0.90 0.61 0.48 0.46 0.86 0.86 0.77 0.50 0.56 0.00 ... 0.56 0.36 0.46 0.57 0.57 0.93 1.00 0.83 0.66 0.74
concave points_worst 0.97 0.76 0.41 0.53 0.83 0.84 0.90 0.54 0.22 0.18 ... 0.61 0.25 0.20 0.41 0.59 0.79 0.83 1.00 0.53 0.41
symmetry_worst 0.53 0.16 0.00 0.68 0.73 0.53 0.50 0.82 0.63 0.20 ... 0.17 0.68 0.16 0.21 0.69 0.75 0.66 0.53 1.00 0.71
fractal_dimension_worst 0.32 0.19 0.00 0.57 0.62 0.45 0.34 0.41 0.68 0.00 ... 0.36 0.66 0.53 0.27 0.63 0.80 0.74 0.41 0.71 1.00

25 rows × 25 columns

In [33]:
from phik.report import plot_correlation_matrix
phik_overview = df.phik_matrix()
interval columns not set, guessing: ['diagnosis', 'radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'texture_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
In [34]:
# Presenting Phi K data as a heatmap (credit: Jai Gupta, Stanford SPCS 2023)
plot_correlation_matrix(phik_overview.values, 
                        x_labels=phik_overview.columns, 
                        y_labels=phik_overview.index, 
                        vmin=0, vmax=1, color_map="coolwarm", 
                        title=r"correlation $\phi_K$", 
                        fontsize_factor=1.8, 
                        figsize=(40, 32))
plt.tight_layout()
plt.show()
No description has been provided for this image
In [35]:
# Creates a violin plot using the Seaborn library
sns.violinplot(df['concave points_worst'],color='y');
plt.show()
No description has been provided for this image
In [36]:
sns.violinplot(x=df['concave points_worst'],color='y'); # Note inclusion of "x=" to rotate the plot
plt.show()
No description has been provided for this image
In [37]:
# Function to create plot for categorical variable
# Annotate barplot
def annotate_bars(ax, feature):
    total = len(feature)
    for p in ax.patches:
        percentage = '{:.2f}%'.format(100 * p.get_height() / total)
        x = p.get_x() + p.get_width() / 2
        y = p.get_height()
        ax.annotate(percentage, (x, y), ha='center', va='bottom', size=12)

# Create figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
sns.countplot(data=df, x='diagnosis', palette='winter', ax=ax)
annotate_bars(ax, df['diagnosis'])
plt.tight_layout()
plt.show()
No description has been provided for this image
In [38]:
# From UT Austin Computer Science Department
# Used this function to create a combo boxplot and histogram for continuous (I/R --> int64 and float64) variables

def boxplot_histogram (feature, figsize=(10,7), bins = None):
    sns.set(font_scale=2) 
    f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, 
                                           sharex = True,
                                           gridspec_kw = {"height_ratios": (.25, .75)}, 
                                           figsize = figsize 
                                           ) 
    sns.boxplot(feature, ax=ax_box2, orient = "h", showmeans=True, color='red') # mean value will be noted
    sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins) if bins else sns.distplot(feature, kde=False, ax=ax_hist2, fit=norm)
    ax_hist2.axvline(np.mean(feature), color='g', linestyle='--') # Add mean to the histogram
    ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
    plt.axvline(feature.mode()[0], color='r', linestyle='dashed', linewidth=1); #Add mode to the histogram
In [39]:
boxplot_histogram(df.radius_mean)
plt.show()
No description has been provided for this image
In [40]:
boxplot_histogram(df.concavity_mean)
plt.show()
No description has been provided for this image
In [41]:
boxplot_histogram(df["concave points_worst"])
plt.show()
No description has been provided for this image
In [42]:
# Plot histograms to check the distribution of each numeric variable
from scipy.stats import norm
all_col = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(17,75))

for i in range(len(all_col)):
    plt.subplot(18,3,i+1)
    plt.hist(df[all_col[i]])
    plt.tight_layout()
    plt.title(all_col[i],fontsize=25)
    
plt.show()
No description has been provided for this image
In [43]:
# Outlier detection using boxplots (for all I/R variables)
plt.figure(figsize=(20,30))

for i, variable in enumerate(df): #enumerate is a built-in function in python that allows you to keep track of the number of iterations (loops) in a loop
                     plt.subplot(8,4,i+1) #provides a way to plot multiple plots on a single figure
                     plt.boxplot(df[variable],whis=1.5)
                     plt.tight_layout()
                     plt.title(variable)
                    
plt.show()
No description has been provided for this image
In [44]:
# Use flooring and capping method
def treat_outliers(df,col):
    Q1=df[col].quantile(0.25) 
    Q3=df[col].quantile(0.75) 
    IQR=Q3-Q1
    Lower_Whisker = Q1 - 1.5*IQR 
    Upper_Whisker = Q3 + 1.5*IQR
    df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)                                                            
    return df

def treat_outliers_all(df, col_list):
    for c in col_list:
        df = treat_outliers(df,c)
    return df

numerical_col = df.select_dtypes(include=np.number).columns.tolist()
df = treat_outliers_all(df,numerical_col)

plt.figure(figsize=(20,30))

for i, variable in enumerate(numerical_col):
                     plt.subplot(8,4,i+1)
                     plt.boxplot(df[variable],whis=1.5)
                     plt.tight_layout()
                     plt.title(variable)

plt.show()
No description has been provided for this image

Bivariate Analyses¶

In [45]:
# Boxplot with DV and 1 IV
# sns.boxplot(x = "categorical_var", y = "numeric_var", data = df)
# plt.title('graph_title')
# plt.show()

plt.figure(figsize=(10,7))
sns.boxplot(x = "diagnosis", y = "concave points_worst", data = df)
plt.title('Boxplot for diagnosis vs. concave points_worst')
plt.show()
No description has been provided for this image
In [46]:
ttest_boxplot = df.boxplot(column='radius_mean', by='diagnosis', figsize=(10, 6), grid=False);
ttest_boxplot.set_title('');
ttest_boxplot.set_ylabel('');
plt.show()
No description has been provided for this image
In [47]:
# Catplot
sns.catplot(x="diagnosis", y="concave points_worst", data=df, kind='boxen', height=6, aspect=1.6, estimator=np.mean);
figsize=(10, 6)
plt.xlabel('diagnosis', fontsize=15);
plt.title('Catplot for concave points_mean vs. diagnosis')
plt.ylabel('concave points_mean', fontsize=15);
plt.show()
No description has been provided for this image
In [48]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='smoothness_mean', y='radius_mean', hue='diagnosis', data=df, palette='Set2')
plt.show()
No description has been provided for this image
In [49]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='smoothness_mean', y='concave points_worst', hue='diagnosis', data=df, palette='Set2')
plt.show()
No description has been provided for this image
In [50]:
# View Outcome column
df.diagnosis
Out[50]:
0      1
1      1
2      1
3      1
4      1
      ..
564    1
565    1
566    1
567    1
568    0
Name: diagnosis, Length: 569, dtype: int64
In [51]:
# Map 0's and 1's to 'No Diabetes' and 'Diabetes', respectively
df.Outcome=df['diagnosis'].map({0:'benign', 1:'malignant'})
In [52]:
# After running code above, 0's become 'No Diabetes'...
df.Outcome
Out[52]:
0      malignant
1      malignant
2      malignant
3      malignant
4      malignant
         ...    
564    malignant
565    malignant
566    malignant
567    malignant
568       benign
Name: diagnosis, Length: 569, dtype: object
In [53]:
# Now notice with this stripplot that the x-axis has words instead of 0 and 1
plt.figure(figsize=(10,6))
sns.stripplot(data=df, x='diagnosis', y='area_se', jitter=True);
plt.show();
No description has been provided for this image
In [54]:
# To map values back into numbers
df.Outcome=df['diagnosis'].map({'benign':0, 'malignant':1})
plt.show();
In [55]:
# Rerun the stripplot to see the x-axis labels are now 0 and 1 again
plt.figure(figsize=(10,6))
sns.stripplot(data=df, x='diagnosis', y='concavity_mean', jitter=True);
plt.show()
No description has been provided for this image
In [56]:
# Create swarm plot
plt.figure(figsize=(10,6))
sns.swarmplot(data=df, x='diagnosis', y='radius_mean');
plt.show();
No description has been provided for this image

One-Hot Encoding¶

In [57]:
# Converting categorical variables into binary vectors, where each category becomes a new feature with values of 0 or 1

# Let's create a fake dataframe simply looking at colors:

df_example = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue']})
df_example
Out[57]:
Color
0 Red
1 Blue
2 Green
3 Red
4 Blue
In [58]:
# Import OneHot encoder
from sklearn.preprocessing import OneHotEncoder

# Initialize One-HotEncoder
encoder = OneHotEncoder()

# Perform encoding
encoded_df = encoder.fit_transform(df_example[['Color']])

# Convert encoded data to a pandas DataFrame
encoded_df = pd.DataFrame(encoded_df.toarray(), columns=encoder.get_feature_names_out(['Color']))

# Concatenate the original data with the encoded data
data_encoded = pd.concat([df_example, encoded_df], axis=1)

print(data_encoded)
   Color  Color_Blue  Color_Green  Color_Red
0    Red        0.00         0.00       1.00
1   Blue        1.00         0.00       0.00
2  Green        0.00         1.00       0.00
3    Red        0.00         0.00       1.00
4   Blue        1.00         0.00       0.00

Supervised Machine Learning (Binary Classification)¶

In [59]:
# Make a copy of dataframe
data=df.copy()
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    int64  
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   smoothness_mean          569 non-null    float64
 4   compactness_mean         569 non-null    float64
 5   concavity_mean           569 non-null    float64
 6   concave points_mean      569 non-null    float64
 7   symmetry_mean            569 non-null    float64
 8   fractal_dimension_mean   569 non-null    float64
 9   texture_se               569 non-null    float64
 10  perimeter_se             569 non-null    float64
 11  area_se                  569 non-null    float64
 12  smoothness_se            569 non-null    float64
 13  compactness_se           569 non-null    float64
 14  concavity_se             569 non-null    float64
 15  concave points_se        569 non-null    float64
 16  symmetry_se              569 non-null    float64
 17  fractal_dimension_se     569 non-null    float64
 18  texture_worst            569 non-null    float64
 19  smoothness_worst         569 non-null    float64
 20  compactness_worst        569 non-null    float64
 21  concavity_worst          569 non-null    float64
 22  concave points_worst     569 non-null    float64
 23  symmetry_worst           569 non-null    float64
 24  fractal_dimension_worst  569 non-null    float64
dtypes: float64(24), int64(1)
memory usage: 111.3 KB
In [60]:
# Libraries for different ML classifiers

from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree

# Libraries for model tuning and evaluation metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV

Decision Tree¶

In [61]:
# Separate Outcome or Target variable from the predictors

X = data.drop('diagnosis',axis=1)    # Replace 'Outcome' with your target variable name
y = data['diagnosis'].astype('int64') # Replace 'Outcome' with your target variable name

# We used .astype ('int64) above to convert target to integers since some functions might not work with bool type
In [62]:
# Split the data into training and test sets

X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1)
print(X_train.shape, X_test.shape)
(398, 24) (171, 24)

Initial model using the Decision Tree Classifier¶

In [63]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='gini',class_weight={0:0.15,1:0.85},random_state=1)

# 2 commonly used splitting criteria are Gini impurity and information gain (entropy)
# Gini: measures the probability of misclassifying a randomly chosen element if it were randomly labeled
    # Would goal be to minimize or maximize the Gini impurity when making splits???
        # MINIMIZE
    
    
# Information Gain (Entropy): entropy measures impurity or uncertainty, while information gain quantifies reduction in entropy
    # Which do we want to minimize? Maximize?
        # MINIMIZE Entropy
        # MAXIMIZE Information Gain
In [64]:
model.fit(X_train, y_train)
Out[64]:
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
In [65]:
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
    y_predict = model.predict(X_test)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
                  columns = [i for i in ['Predicted - No','Predicted - Yes']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
In [66]:
make_confusion_matrix(model,y_test)
plt.show()
No description has been provided for this image
In [67]:
y_train.value_counts(1)
Out[67]:
diagnosis
0   0.63
1   0.37
Name: proportion, dtype: float64
In [68]:
column_names = list(data.columns)
column_names.remove('diagnosis')  # As this is the DV                
feature_names = column_names
print(feature_names)
['radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'texture_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
In [69]:
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics

plt.figure(figsize=(20,30))
out = tree.plot_tree(model,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None,)
# Code below will add arrows to the decision tree split if they are missing
for o in out:
     arrow = o.arrow_patch
     if arrow is not None:
        arrow.set_edgecolor('black')
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [70]:
# Text report showing the rules of the decision tree

print(tree.export_text(model,feature_names=feature_names,show_weights=True))
|--- concave points_worst <= 0.11
|   |--- radius_mean <= 15.44
|   |   |--- area_se <= 48.98
|   |   |   |--- weights: [31.50, 0.00] class: 0
|   |   |--- area_se >  48.98
|   |   |   |--- smoothness_worst <= 0.11
|   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |--- smoothness_worst >  0.11
|   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |--- radius_mean >  15.44
|   |   |--- concavity_mean <= 0.04
|   |   |   |--- symmetry_se <= 0.01
|   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |--- symmetry_se >  0.01
|   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |--- concavity_mean >  0.04
|   |   |   |--- weights: [0.30, 0.00] class: 0
|--- concave points_worst >  0.11
|   |--- concave points_mean <= 0.05
|   |   |--- compactness_se <= 0.02
|   |   |   |--- texture_worst <= 19.91
|   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |--- texture_worst >  19.91
|   |   |   |   |--- concavity_mean <= 0.05
|   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |--- concavity_mean >  0.05
|   |   |   |   |   |--- weights: [0.00, 5.10] class: 1
|   |   |--- compactness_se >  0.02
|   |   |   |--- smoothness_worst <= 0.18
|   |   |   |   |--- weights: [2.70, 0.00] class: 0
|   |   |   |--- smoothness_worst >  0.18
|   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |--- concave points_mean >  0.05
|   |   |--- texture_mean <= 14.16
|   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |--- texture_mean >  14.16
|   |   |   |--- concavity_worst <= 0.22
|   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |--- concavity_worst >  0.22
|   |   |   |   |--- radius_mean <= 10.41
|   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |--- radius_mean >  10.41
|   |   |   |   |   |--- area_se <= 13.47
|   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |--- area_se >  13.47
|   |   |   |   |   |   |--- texture_worst <= 18.35
|   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |--- texture_worst >  18.35
|   |   |   |   |   |   |   |--- smoothness_worst <= 0.10
|   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |--- smoothness_worst >  0.10
|   |   |   |   |   |   |   |   |--- concave points_mean <= 0.06
|   |   |   |   |   |   |   |   |   |--- radius_mean <= 13.14
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- radius_mean >  13.14
|   |   |   |   |   |   |   |   |   |   |--- texture_mean <= 15.49
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- texture_mean >  15.49
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |--- concave points_mean >  0.06
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 107.10] class: 1

In [71]:
# Importance of features in the tree building (The importance of a feature is computed as the 
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance)

print (pd.DataFrame(model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                         Imp
concave points_worst    0.70
radius_mean             0.11
concave points_mean     0.04
smoothness_worst        0.04
compactness_se          0.03
area_se                 0.03
texture_mean            0.02
texture_worst           0.02
concavity_mean          0.01
concavity_worst         0.01
symmetry_worst          0.00
perimeter_se            0.00
symmetry_se             0.00
smoothness_mean         0.00
compactness_worst       0.00
compactness_mean        0.00
concave points_se       0.00
fractal_dimension_se    0.00
symmetry_mean           0.00
concavity_se            0.00
smoothness_se           0.00
texture_se              0.00
fractal_dimension_mean  0.00
fractal_dimension_worst 0.00
In [72]:
# Example of Feature Importance

importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(20,15))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
No description has been provided for this image
In [73]:
from sklearn.model_selection import GridSearchCV
In [74]:
# Choose the type of classifier
estimator = DecisionTreeClassifier(random_state=1,class_weight = {0:.15,1:.85}) 
# Random state = controls random shuffling and splitting
# Grid of parameters to choose from
parameters = {
            'max_depth': np.arange(15,27),
            'criterion': ['entropy','gini'],
            'splitter': ['best','random'],
            'min_impurity_decrease': [0.0001,0.001,0.001],
            'max_features': ['log2','sqrt']
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data
estimator.fit(X_train, y_train)
Out[74]:
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=15,
                       max_features='log2', min_impurity_decrease=0.001,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=15,
                       max_features='log2', min_impurity_decrease=0.001,
                       random_state=1)
In [75]:
make_confusion_matrix(estimator,y_test)
In [76]:
plt.figure(figsize=(15,10))
out = tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=None)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor('black')
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
No description has been provided for this image
In [77]:
print(tree.export_text(estimator,feature_names=feature_names,show_weights=False))
|--- concave points_worst <= 0.11
|   |--- fractal_dimension_se <= 0.00
|   |   |--- concavity_worst <= 0.12
|   |   |   |--- class: 0
|   |   |--- concavity_worst >  0.12
|   |   |   |--- radius_mean <= 15.44
|   |   |   |   |--- class: 0
|   |   |   |--- radius_mean >  15.44
|   |   |   |   |--- class: 1
|   |--- fractal_dimension_se >  0.00
|   |   |--- compactness_se <= 0.01
|   |   |   |--- compactness_mean <= 0.06
|   |   |   |   |--- class: 0
|   |   |   |--- compactness_mean >  0.06
|   |   |   |   |--- smoothness_mean <= 0.10
|   |   |   |   |   |--- fractal_dimension_se <= 0.00
|   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |--- fractal_dimension_se >  0.00
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |--- smoothness_mean >  0.10
|   |   |   |   |   |--- class: 0
|   |   |--- compactness_se >  0.01
|   |   |   |--- class: 0
|--- concave points_worst >  0.11
|   |--- texture_worst <= 18.39
|   |   |--- class: 0
|   |--- texture_worst >  18.39
|   |   |--- radius_mean <= 12.65
|   |   |   |--- concave points_se <= 0.01
|   |   |   |   |--- fractal_dimension_worst <= 0.11
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- fractal_dimension_worst >  0.11
|   |   |   |   |   |--- class: 1
|   |   |   |--- concave points_se >  0.01
|   |   |   |   |--- class: 0
|   |   |--- radius_mean >  12.65
|   |   |   |--- concavity_worst <= 0.22
|   |   |   |   |--- class: 0
|   |   |   |--- concavity_worst >  0.22
|   |   |   |   |--- concavity_mean <= 0.05
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- concavity_mean >  0.05
|   |   |   |   |   |--- class: 1

In [78]:
# Gini importance
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                         Imp
concave points_worst    0.74
radius_mean             0.06
concavity_worst         0.05
texture_worst           0.03
compactness_mean        0.03
fractal_dimension_se    0.02
fractal_dimension_worst 0.02
concave points_se       0.02
compactness_se          0.01
smoothness_mean         0.01
concavity_mean          0.01
symmetry_mean           0.00
symmetry_worst          0.00
compactness_worst       0.00
smoothness_worst        0.00
concave points_mean     0.00
fractal_dimension_mean  0.00
texture_se              0.00
concavity_se            0.00
texture_mean            0.00
smoothness_se           0.00
area_se                 0.00
perimeter_se            0.00
symmetry_se             0.00
In [79]:
importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(20,15))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='green', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
No description has been provided for this image

Decision Tree¶

In [80]:
dtree_estimator = DecisionTreeClassifier(class_weight={0:0.3,1:0.70},random_state=1)

dtree_estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.3, 1: 0.7})
dtree_estimator.fit(X_train, y_train)
Out[80]:
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, random_state=1)
In [81]:
#  Function to calculate different metric scores - Accuracy, Recall, Preceision, and F1 Scores

def get_metrics_score(model,flag=True):
    # defining an empty list to store train and test results
    score_list=[] 
    
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    
    train_acc = model.score(X_train,y_train)
    test_acc = model.score(X_test,y_test)
    
    train_recall = metrics.recall_score(y_train,pred_train)
    test_recall = metrics.recall_score(y_test,pred_test)
    # Recall = minimizes false negatives
    
    train_precision = metrics.precision_score(y_train,pred_train)
    test_precision = metrics.precision_score(y_test,pred_test)
    # Precision = minimizes false positives

    train_f1 = metrics.f1_score(y_train,pred_train)
    test_f1 = metrics.f1_score(y_test,pred_test)
    # F1 Score = balances precision and recall
    
    score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1,test_f1))
        
    if flag == True: 
        print("Accuracy on training set : ",model.score(X_train,y_train))
        print("Accuracy on test set : ",model.score(X_test,y_test))
        print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
        print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
        print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
        print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
        print("F1 Score on training set : ",metrics.f1_score(y_train,pred_train))
        print("F1 Score on test set : ",metrics.f1_score(y_test,pred_test))
    
    return score_list # returns the list with train and test scores
In [82]:
# Function to make confusion matrix

def make_confusion_matrix(model, y_actual, labels=[1, 0]):
    # Predict using the model
    y_predict = model.predict(X_test)  # <- This should match y_actual's source
    cm = metrics.confusion_matrix(y_actual, y_predict, labels=[0, 1])
    
    # Create a labeled DataFrame for the confusion matrix
    df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
                              columns = [i for i in ['Predicted - No','Predicted - Yes']])
    
    # Format counts and percentages for annotation
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten()/np.sum(cm)]
    
    # Combine into annotation labels
    labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    
    # Plot the heatmap
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels, fmt='', cmap='Blues')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
In [83]:
# Calculate metrics for your model
get_metrics_score(dtree_estimator)

# Create confusion matrix for your mdodel
make_confusion_matrix(dtree_estimator, y_test)
Accuracy on training set :  1.0
Accuracy on test set :  0.9239766081871345
Recall on training set :  1.0
Recall on test set :  0.8412698412698413
Precision on training set :  1.0
Precision on test set :  0.9464285714285714
F1 Score on training set :  1.0
F1 Score on test set :  0.8907563025210085
No description has been provided for this image
In [84]:
dtree_tuned = DecisionTreeClassifier(class_weight={0:0.35, 1:0.65}, random_state=1)

parameters = {
    'max_depth': np.arange(2, 10),
    'min_samples_leaf': [5, 7, 10, 15],
    'max_leaf_nodes': [2, 3, 5, 10, 15],
    'min_impurity_decrease': [0.0001, 0.001, 0.01, 0.1]
}

# Parameters above control the size and shape of the tree, preventing it from growing too large or fitting noise.

scorer = metrics.make_scorer(metrics.recall_score)

grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=scorer, n_jobs=-1)
grid_obj.fit(X_train, y_train)

dtree_tuned = grid_obj.best_estimator_

dtree_tuned.fit(X_train, y_train)
Out[84]:
DecisionTreeClassifier(class_weight={0: 0.35, 1: 0.65}, max_depth=3,
                       max_leaf_nodes=5, min_impurity_decrease=0.01,
                       min_samples_leaf=7, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.35, 1: 0.65}, max_depth=3,
                       max_leaf_nodes=5, min_impurity_decrease=0.01,
                       min_samples_leaf=7, random_state=1)
In [85]:
get_metrics_score(dtree_tuned)
make_confusion_matrix(dtree_tuned,y_test)
Accuracy on training set :  0.957286432160804
Accuracy on test set :  0.9181286549707602
Recall on training set :  0.959731543624161
Recall on test set :  0.8888888888888888
Precision on training set :  0.9285714285714286
Precision on test set :  0.8888888888888888
F1 Score on training set :  0.9438943894389439
F1 Score on test set :  0.8888888888888888
No description has been provided for this image

Bagging Classifier¶

In [86]:
# Fit the model
bagging_classifier = BaggingClassifier(random_state=1)
bagging_classifier.fit(X_train,y_train)

# Calculate metrics
get_metrics_score(bagging_classifier)

# Create the confusion matrix
make_confusion_matrix(bagging_classifier,y_test)
Accuracy on training set :  0.9974874371859297
Accuracy on test set :  0.9590643274853801
Recall on training set :  0.9932885906040269
Recall on test set :  0.9206349206349206
Precision on training set :  1.0
Precision on test set :  0.9666666666666667
F1 Score on training set :  0.9966329966329966
F1 Score on test set :  0.943089430894309
No description has been provided for this image

Tuned Bagging Classifier¶

In [87]:
# Define base BaggingClassifier
bagging_tuned = BaggingClassifier(random_state=1)

# Hyperparameter grid
parameters = {
    'max_samples': [0.7, 0.8, 0.9, 1],
    'max_features': [0.7, 0.8, 0.9, 1],
    'n_estimators': [10, 20, 30, 40, 50],
}

# Use recall as scoring metric
recall_scorer = metrics.make_scorer(metrics.recall_score)

# Setup GridSearchCV with parallel jobs and 5-fold CV
grid_obj = GridSearchCV(bagging_tuned, parameters, scoring=recall_scorer, cv=5, n_jobs=-1)

# Fit grid search
grid_obj.fit(X_train, y_train)

# Best estimator from grid search
bagging_tuned = grid_obj.best_estimator_

# Fit the best model on full training data
bagging_tuned.fit(X_train, y_train)
Out[87]:
BaggingClassifier(max_features=0.7, max_samples=0.9, n_estimators=40,
                  random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(max_features=0.7, max_samples=0.9, n_estimators=40,
                  random_state=1)
In [88]:
get_metrics_score(bagging_tuned)

make_confusion_matrix(bagging_tuned,y_test)
Accuracy on training set :  0.9974874371859297
Accuracy on test set :  0.9532163742690059
Recall on training set :  0.9932885906040269
Recall on test set :  0.9206349206349206
Precision on training set :  1.0
Precision on test set :  0.9508196721311475
F1 Score on training set :  0.9966329966329966
F1 Score on test set :  0.9354838709677419
No description has been provided for this image

Random Forest¶

In [89]:
# Fit the model
rf_estimator = RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train,y_train)

# Calculate metrics
get_metrics_score(rf_estimator)

# Create the confusion matrix
make_confusion_matrix(rf_estimator,y_test)
Accuracy on training set :  1.0
Accuracy on test set :  0.9590643274853801
Recall on training set :  1.0
Recall on test set :  0.9206349206349206
Precision on training set :  1.0
Precision on test set :  0.9666666666666667
F1 Score on training set :  1.0
F1 Score on test set :  0.943089430894309
No description has been provided for this image
In [90]:
%%time

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import numpy as np

rf_tuned = RandomForestClassifier(class_weight={0:0.35,1:0.65}, random_state=1)

parameters = {  
    'max_depth': list(np.arange(3, 10, 1)),
    'max_features': np.arange(0.6, 1.1, 0.1),
    'min_samples_split': np.arange(2, 20, 5),
    'n_estimators': np.arange(30, 160, 20),
    'min_impurity_decrease': [0.0001, 0.001, 0.01, 0.1]
}

scorer = metrics.make_scorer(metrics.recall_score)

grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=5, n_jobs=-1)
grid_obj.fit(X_train, y_train)

rf_tuned = grid_obj.best_estimator_
rf_tuned.fit(X_train, y_train)
CPU times: user 7.78 s, sys: 991 ms, total: 8.77 s
Wall time: 3min 9s
Out[90]:
RandomForestClassifier(class_weight={0: 0.35, 1: 0.65}, max_depth=5,
                       max_features=0.6, min_impurity_decrease=0.01,
                       min_samples_split=12, n_estimators=90, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.35, 1: 0.65}, max_depth=5,
                       max_features=0.6, min_impurity_decrease=0.01,
                       min_samples_split=12, n_estimators=90, random_state=1)
In [91]:
#Calculating different metrics
get_metrics_score(rf_tuned)

#Creating confusion matrix
make_confusion_matrix(rf_tuned,y_test)
Accuracy on training set :  0.9698492462311558
Accuracy on test set :  0.9415204678362573
Recall on training set :  0.9731543624161074
Recall on test set :  0.9206349206349206
Precision on training set :  0.9477124183006536
Precision on test set :  0.9206349206349206
F1 Score on training set :  0.9602649006622517
F1 Score on test set :  0.9206349206349206
No description has been provided for this image

Comparing Supervised ML Classification Models¶

In [92]:
# Identify the models to compare
models = [dtree_estimator, dtree_tuned, bagging_classifier, bagging_tuned, rf_estimator, rf_tuned]

# Define empty lists to store results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []

# Loop through all models to collect metrics (Accuracy, Recall, Precision, F1)
for model in models:
    scores = get_metrics_score(model, False)
    acc_train.append(scores[0])
    acc_test.append(scores[1])
    recall_train.append(scores[2])
    recall_test.append(scores[3])
    precision_train.append(scores[4])
    precision_test.append(scores[5])
    f1_train.append(scores[6])
    f1_test.append(scores[7])
In [93]:
# Compare models on evaluation metrics

comparison_frame = pd.DataFrame({
    'Model': ['Decision Tree', 'Tuned Decision Tree', 'Bagging Classifier', 'Tuned Bagging Classifier', 'Random Forest', 'Tuned Random Forest'],
    'Train_Accuracy': acc_train,
    'Test_Accuracy': acc_test,
    'Train_Recall': recall_train,
    'Test_Recall': recall_test,
    'Train_Precision': precision_train,
    'Test_Precision': precision_test,
    'Train_F1': f1_train,
    'Test_F1': f1_test
})

# Sort models in decreasing order of most important metric
comparison_frame_sorted = comparison_frame.sort_values(by='Test_Recall', ascending=False)

# Set display options to avoid wrapping wide DataFrames
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', None)

# Print sorted DataFrame
print(comparison_frame_sorted)
                      Model  Train_Accuracy  Test_Accuracy  Train_Recall  Test_Recall  Train_Precision  Test_Precision  Train_F1  Test_F1
2        Bagging Classifier            1.00           0.96          0.99         0.92             1.00            0.97      1.00     0.94
3  Tuned Bagging Classifier            1.00           0.95          0.99         0.92             1.00            0.95      1.00     0.94
4             Random Forest            1.00           0.96          1.00         0.92             1.00            0.97      1.00     0.94
5       Tuned Random Forest            0.97           0.94          0.97         0.92             0.95            0.92      0.96     0.92
1       Tuned Decision Tree            0.96           0.92          0.96         0.89             0.93            0.89      0.94     0.89
0             Decision Tree            1.00           0.92          1.00         0.84             1.00            0.95      1.00     0.89
In [94]:
# Get feature names and their importance scores for the best model

model_name = 'Tuned Random Forest'  # dynamically assign this as needed

plt.figure(figsize=(20, 15))
plt.title(f'Feature Importance - {model_name}')
plt.barh(range(len(indices)), importances[indices], color='green', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')

for i, idx in enumerate(indices):
    val = importances[idx]
    if val < 0.001:
        label = "<0.001"
    else:
        label = f"{val:.3f}"
    plt.text(val + 0.001, i, label, va='center', fontsize=15)

plt.tight_layout()
plt.show()
No description has been provided for this image

Logistic Regression¶

In [95]:
# Import additional library
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
In [96]:
# Standardize the features before running logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the logistic regression model
log_reg = LogisticRegression(
    solver='newton-cg',
    max_iter=1000,
    penalty='l2',           # Regularization
    verbose=True,           # Shows optimization progress
    n_jobs=-1,              # Use all CPU cores
    random_state=1
)
In [97]:
def get_metrics_score(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
In [98]:
# Fit the model to the training data
log_reg.fit(X_train_scaled, y_train)

# Predict probabilities for ROC curve
y_probs = log_reg.predict_proba(X_test_scaled)[:, 1]
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
In [99]:
def get_metrics_score(model, X_train, y_train, X_test, y_test):
    # Predict on train and test
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # Print results
    print("Model Performance")
    print("-" * 40)
    print("{:<15} {:<10} {:<10}".format("Metric", "Train", "Test"))
    print("-" * 40)
    print("{:<15} {:<10.2f} {:<10.2f}".format("Accuracy", 
          accuracy_score(y_train, y_pred_train), accuracy_score(y_test, y_pred_test)))
    print("{:<15} {:<10.2f} {:<10.2f}".format("Precision", 
          precision_score(y_train, y_pred_train), precision_score(y_test, y_pred_test)))
    print("{:<15} {:<10.2f} {:<10.2f}".format("Recall", 
          recall_score(y_train, y_pred_train), recall_score(y_test, y_pred_test)))
    print("{:<15} {:<10.2f} {:<10.2f}".format("F1 Score", 
          f1_score(y_train, y_pred_train), f1_score(y_test, y_pred_test)))
    print("-" * 40)
In [100]:
def make_confusion_matrix(model, X_test, y_actual, labels=[0, 1]):
    # Predict using the model
    y_predict = model.predict(X_test)
    
    # Compute confusion matrix
    cm = metrics.confusion_matrix(y_actual, y_predict, labels=labels)
    
    # Create labeled DataFrame
    df_cm = pd.DataFrame(
        cm,
        index=["Actual - No", "Actual - Yes"],
        columns=["Predicted - No", "Predicted - Yes"]
    )
    
    # Prepare annotation labels (counts + percentages)
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
    annot_labels = [f"{count}\n{percent}" for count, percent in zip(group_counts, group_percentages)]
    annot_labels = np.asarray(annot_labels).reshape(cm.shape)
    
    # Plot the heatmap
    plt.figure(figsize=(10, 7))
    sns.heatmap(df_cm, annot=annot_labels, fmt="", cmap="Blues")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.title("")
    plt.tight_layout()
    plt.show()
In [101]:
get_metrics_score(log_reg, X_train_scaled, y_train, X_test_scaled, y_test)
make_confusion_matrix(log_reg, X_test_scaled, y_test)
Model Performance
----------------------------------------
Metric          Train      Test      
----------------------------------------
Accuracy        0.99       0.95      
Precision       1.00       0.94      
Recall          0.98       0.94      
F1 Score        0.99       0.94      
----------------------------------------
No description has been provided for this image

Create ROC Curve¶

In [102]:
# Import additional library
from sklearn.metrics import roc_curve, auc

# Ensure X_test is scaled like training data
y_probs = log_reg.predict_proba(X_test_scaled)[:, 1]

# Compute ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_probs)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='red', lw=1, linestyle='--', label='Chance')

plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize=14)
plt.tick_params(axis='both', labelsize=12)
plt.legend(loc='lower right', fontsize=14)
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Sigmoid Curve¶

In [103]:
from scipy.special import expit  # Numerically stable sigmoid

# Choose the feature you want to visualize
feature_name = "concave points_worst"  # <-- CHANGE this to your real feature name
if feature_name not in X_train.columns:
    raise ValueError(f"'{feature_name}' is not a valid feature in X_train.")

# Get the index of the feature
feature_index = list(X_train.columns).index(feature_name)

# Get the corresponding coefficient and intercept from the trained model
coef = log_reg.coef_[0][feature_index]
intercept = log_reg.intercept_[0]

# Generate a range of values across that feature's actual (unscaled) range
x_vals = np.linspace(X_train[feature_name].min(), X_train[feature_name].max(), 300)

# Re-standardize manually to match the model's input scale
mean = X_train[feature_name].mean()
std = X_train[feature_name].std()
x_vals_scaled = (x_vals - mean) / std

# Compute z = w*x + b and apply sigmoid
z = intercept + coef * x_vals_scaled
sigmoid_vals = expit(z)

# Plot the sigmoid function
plt.figure(figsize=(10, 6))
plt.plot(x_vals, sigmoid_vals, label='Sigmoid Curve', color='blue')
plt.axhline(0.5, color='red', linestyle='--', label='Threshold = 0.5')
plt.xlabel(f'{feature_name}', fontsize=15)
plt.ylabel('Predicted Probability (Class 1)', fontsize=15)
plt.title(f'Sigmoid Function for Feature: {feature_name}')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

# Compute the feature value at which probability = 0.5 (decision boundary)
decision_boundary_scaled = -intercept / coef
decision_boundary = decision_boundary_scaled * std + mean  # convert back to original scale

# Compute curve steepness (magnitude of slope)
steepness = abs(coef)

# Generate interpretation message
print("\nInterpretation:")
print(f"- This sigmoid curve shows how the feature '{feature_name}' influences the model's prediction for class 1.")
print(f"- The decision threshold (where predicted probability = 0.5) occurs at approximately **{decision_boundary:.2f}**.")
print(f"- Feature values **below {decision_boundary:.2f}** are associated with a low predicted probability of class 1.")
print(f"- Feature values **above {decision_boundary:.2f}** are associated with a high predicted probability of class 1.")

# Interpret steepness
if steepness > 5:
    print(f"- The curve is steep, meaning '{feature_name}' is a **strong predictor** in the model.")
elif steepness > 1:
    print(f"- The curve is moderately steep, so '{feature_name}' is a **meaningful but not dominant predictor**.")
else:
    print(f"- The curve is relatively flat, so '{feature_name}' may have **limited predictive power** on its own.")
No description has been provided for this image
Interpretation:
- This sigmoid curve shows how the feature 'concave points_worst' influences the model's prediction for class 1.
- The decision threshold (where predicted probability = 0.5) occurs at approximately **0.17**.
- Feature values **below 0.17** are associated with a low predicted probability of class 1.
- Feature values **above 0.17** are associated with a high predicted probability of class 1.
- The curve is relatively flat, so 'concave points_worst' may have **limited predictive power** on its own.

Support Vector Machines (SVM)¶

In [104]:
# Import additional required library
from sklearn.svm import SVC

# Build function to evaluate model on scaled data
def evaluate_model(name, model, X_train_scaled, y_train, X_test_scaled, y_test): 
    print(f"\n Performance: {name}") 
    print("-" * 50)

    # Predict on train and test sets
    y_pred_train = model.predict(X_train_scaled)
    y_pred_test = model.predict(X_test_scaled)

    # Compute metrics
    train_metrics = {
        'Accuracy': accuracy_score(y_train, y_pred_train),
        'Precision': precision_score(y_train, y_pred_train),
        'Recall': recall_score(y_train, y_pred_train),
        'F1 Score': f1_score(y_train, y_pred_train)
    }

    test_metrics = {
        'Accuracy': accuracy_score(y_test, y_pred_test),
        'Precision': precision_score(y_test, y_pred_test),
        'Recall': recall_score(y_test, y_pred_test),
        'F1 Score': f1_score(y_test, y_pred_test)
    }

    # Print table-style output
    print("{:<12} {:<10} {:<10}".format("Metric", "Train", "Test"))
    print("-" * 32)
    for metric in train_metrics:
        print("{:<12} {:<10.2f} {:<10.2f}".format(
            metric, train_metrics[metric], test_metrics[metric]
        ))

    # Confusion Matrix (Test Set Only)
    cm = confusion_matrix(y_test, y_pred_test)
    total = cm.sum()
    labels = [f"{v}\n{v/total:.2%}" for v in cm.flatten()]
    labels = np.array(labels).reshape(cm.shape)

    df_cm = pd.DataFrame(cm,
                         index=["Actual - No", "Actual - Yes"],
                         columns=["Predicted - No", "Predicted - Yes"])

    plt.figure(figsize=(10, 6))
    sns.heatmap(df_cm, annot=labels, fmt='', cmap='Blues', cbar=False)
    plt.title("")
    plt.xlabel("Predicted label")
    plt.ylabel("True label")
    plt.tight_layout()
    plt.show()

# Create the SVM model with a linear kernel
svm_model = SVC(kernel='linear', probability=True, random_state=0)

# Train the model
svm_model.fit(X_train_scaled, y_train)

evaluate_model("Support Vector Machine", svm_model, X_train_scaled, y_train, X_test_scaled, y_test)
 Performance: Support Vector Machine
--------------------------------------------------
Metric       Train      Test      
--------------------------------
Accuracy     0.99       0.96      
Precision    1.00       0.97      
Recall       0.97       0.92      
F1 Score     0.99       0.94      
No description has been provided for this image
In [105]:
# Create SVM confusion matrix

def make_svm_confusion_matrix(model, X_test, y_test, labels=[0, 1]):
    # Predict using the model
    y_pred = model.predict(X_test)
    
    # Compute confusion matrix
    cm = confusion_matrix(y_test, y_pred, labels=labels)
    
    # Create labeled DataFrame
    df_cm = pd.DataFrame(
        cm,
        index=["Actual - No", "Actual - Yes"],
        columns=["Predicted - No", "Predicted - Yes"]
    )
    
    # Prepare annotation labels (counts + percentages)
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
    annot_labels = [f"{count}\n{percent}" for count, percent in zip(group_counts, group_percentages)]
    annot_labels = np.asarray(annot_labels).reshape(cm.shape)
    
    # Plot the heatmap
    plt.figure(figsize=(10, 7))
    sns.heatmap(df_cm, annot=annot_labels, fmt="", cmap="Blues")
    plt.title("")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.tight_layout()
    plt.show()

make_svm_confusion_matrix(svm_model, X_test, y_test)
No description has been provided for this image

Boosting Algorithms¶

In [106]:
def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    print(f"\n Performance: {name}")
    print("-" * 50)

    # Predict on train and test sets
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # Compute metrics
    train_metrics = {
        'Accuracy': accuracy_score(y_train, y_pred_train),
        'Precision': precision_score(y_train, y_pred_train),
        'Recall': recall_score(y_train, y_pred_train),
        'F1 Score': f1_score(y_train, y_pred_train)
    }

    test_metrics = {
        'Accuracy': accuracy_score(y_test, y_pred_test),
        'Precision': precision_score(y_test, y_pred_test),
        'Recall': recall_score(y_test, y_pred_test),
        'F1 Score': f1_score(y_test, y_pred_test)
    }

    # Print table-style output
    print("{:<12} {:<10} {:<10}".format("Metric", "Train", "Test"))
    print("-" * 32)
    for metric in train_metrics:
        print("{:<12} {:<10.2f} {:<10.2f}".format(
            metric, train_metrics[metric], test_metrics[metric]
        ))

    # Confusion Matrix (Test Set Only)
    cm = confusion_matrix(y_test, y_pred_test)
    total = cm.sum()
    labels = [f"{v}\n{v/total:.2%}" for v in cm.flatten()]
    labels = np.array(labels).reshape(cm.shape)

    df_cm = pd.DataFrame(cm,
                         index=["Actual - No", "Actual - Yes"],
                         columns=["Predicted - No", "Predicted - Yes"])

    plt.figure(figsize=(10, 6))
    sns.heatmap(df_cm, annot=labels, fmt='', cmap='Blues', cbar=False)
    plt.title("")
    plt.xlabel("Predicted label")
    plt.ylabel("True label")
    plt.tight_layout()
    plt.show()

AdaBoost (Adaptive Boosting)¶

In [107]:
from sklearn.ensemble import AdaBoostClassifier

adaboost = AdaBoostClassifier(random_state=0)
adaboost.fit(X_train, y_train)

evaluate_model("AdaBoost", adaboost, X_train, y_train, X_test, y_test)
 Performance: AdaBoost
--------------------------------------------------
Metric       Train      Test      
--------------------------------
Accuracy     1.00       0.97      
Precision    1.00       0.97      
Recall       1.00       0.95      
F1 Score     1.00       0.96      
No description has been provided for this image

Tuned AdaBoost¶

In [108]:
tuned_adaboost = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=2),
    n_estimators=100,
    learning_rate=0.8,
    algorithm='SAMME.R',
    random_state=0
)

tuned_adaboost.fit(X_train, y_train)
evaluate_model("Tuned AdaBoost", tuned_adaboost, X_train, y_train, X_test, y_test)
 Performance: Tuned AdaBoost
--------------------------------------------------
Metric       Train      Test      
--------------------------------
Accuracy     1.00       0.96      
Precision    1.00       0.97      
Recall       1.00       0.94      
F1 Score     1.00       0.95      
No description has been provided for this image

Gradient Boosting¶

In [109]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(random_state=0)
gb.fit(X_train, y_train)
evaluate_model("Gradient Boosting", gb, X_train, y_train, X_test, y_test)
 Performance: Gradient Boosting
--------------------------------------------------
Metric       Train      Test      
--------------------------------
Accuracy     1.00       0.96      
Precision    1.00       0.97      
Recall       1.00       0.92      
F1 Score     1.00       0.94      
No description has been provided for this image

Tuned Gradient Boosting¶

In [110]:
tuned_gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=3,
    subsample=0.8,
    max_features='sqrt',
    random_state=0
)
tuned_gb.fit(X_train, y_train)
evaluate_model("Tuned Gradient Boosting", tuned_gb, X_train, y_train, X_test, y_test)
 Performance: Tuned Gradient Boosting
--------------------------------------------------
Metric       Train      Test      
--------------------------------
Accuracy     1.00       0.96      
Precision    1.00       0.95      
Recall       1.00       0.94      
F1 Score     1.00       0.94      
No description has been provided for this image

XGBoost¶

In [111]:
!pip install xgboost

from xgboost import XGBClassifier
Requirement already satisfied: xgboost in /opt/anaconda3/lib/python3.12/site-packages (3.0.2)
Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.12/site-packages (from xgboost) (1.26.4)
Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.12/site-packages (from xgboost) (1.13.1)
In [112]:
xgb = XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=0
)
xgb.fit(X_train, y_train)
evaluate_model("XGBoost (Default)", xgb, X_train, y_train, X_test, y_test)
 Performance: XGBoost (Default)
--------------------------------------------------
Metric       Train      Test      
--------------------------------
Accuracy     1.00       0.97      
Precision    1.00       0.98      
Recall       1.00       0.94      
F1 Score     1.00       0.96      
No description has been provided for this image

Tuned XGBoost¶

In [113]:
tuned_xgb = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    gamma=0.1,
    reg_alpha=0.5,
    reg_lambda=1,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=0
)
tuned_xgb.fit(X_train, y_train)
evaluate_model("XGBoost (Tuned)", tuned_xgb, X_train, y_train, X_test, y_test)
 Performance: XGBoost (Tuned)
--------------------------------------------------
Metric       Train      Test      
--------------------------------
Accuracy     1.00       0.98      
Precision    1.00       0.98      
Recall       0.99       0.95      
F1 Score     1.00       0.97      
No description has been provided for this image

Create expanded comparison table¶

In [114]:
# Identify all models to compare
models = [
    dtree_estimator,
    dtree_tuned,
    bagging_classifier,
    bagging_tuned,
    rf_estimator,
    rf_tuned,
    log_reg,
    svm_model,
    adaboost,
    tuned_adaboost,
    gb,
    tuned_gb,
    xgb,
    tuned_xgb
]
In [115]:
# List of model names
model_names = [
    'Decision Tree',
    'Tuned Decision Tree',
    'Bagging Classifier',
    'Tuned Bagging Classifier',
    'Random Forest',
    'Tuned Random Forest',
    'Logistic Regression',
    'Support Vector Machine',
    'AdaBoost',
    'Tuned AdaBoost',
    'Gradient Boosting',
    'Tuned Gradient Boosting',
    'XGBoost',
    'Tuned XGBoost'
]
In [116]:
# Clear the lists
acc_train, acc_test = [], []
recall_train, recall_test = [], []
precision_train, precision_test = [], []
f1_train, f1_test = [], []
final_model_names = []
In [117]:
def get_metrics_score(model, X_train, y_train, X_test, y_test):
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    acc_train = accuracy_score(y_train, y_pred_train)
    acc_test = accuracy_score(y_test, y_pred_test)

    recall_train = recall_score(y_train, y_pred_train)
    recall_test = recall_score(y_test, y_pred_test)

    precision_train = precision_score(y_train, y_pred_train)
    precision_test = precision_score(y_test, y_pred_test)

    f1_train = f1_score(y_train, y_pred_train)
    f1_test = f1_score(y_test, y_pred_test)

    return [
        acc_train, acc_test,
        recall_train, recall_test,
        precision_train, precision_test,
        f1_train, f1_test
    ]
In [118]:
model_data_map = {
    "Logistic Regression": (X_train_scaled, X_test_scaled),
    "Support Vector Machine": (X_train_scaled, X_test_scaled),
    # others use unscaled
}
In [119]:
# Initialize this 
final_model_names = []

for model, name in zip(models, model_names):
    if model is None:
        continue

    # Use scaled data if defined for that model
    X_tr, X_te = model_data_map.get(name, (X_train, X_test))

    scores = get_metrics_score(model, X_tr, y_train, X_te, y_test)

    acc_train.append(scores[0])
    acc_test.append(scores[1])
    recall_train.append(scores[2])
    recall_test.append(scores[3])
    precision_train.append(scores[4])
    precision_test.append(scores[5])
    f1_train.append(scores[6])
    f1_test.append(scores[7])
    final_model_names.append(name)
In [120]:
print("Lengths:", len(final_model_names), len(acc_train), len(acc_test), len(f1_test))
Lengths: 14 14 14 14
In [121]:
comparison_frame = pd.DataFrame({
    'Model': final_model_names,
    'Train_Accuracy': acc_train,
    'Test_Accuracy': acc_test,
    'Train_Recall': recall_train,
    'Test_Recall': recall_test,
    'Train_Precision': precision_train,
    'Test_Precision': precision_test,
    'Train_F1': f1_train,
    'Test_F1': f1_test
})

comparison_frame_sorted = comparison_frame.sort_values(by='Test_Recall', ascending=False) # Sort by most important metric

# Display the table
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', None)
comparison_frame_sorted
Out[121]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1
8 AdaBoost 1.00 0.97 1.00 0.95 1.00 0.97 1.00 0.96
13 Tuned XGBoost 1.00 0.98 0.99 0.95 1.00 0.98 1.00 0.97
6 Logistic Regression 0.99 0.95 0.98 0.94 1.00 0.94 0.99 0.94
9 Tuned AdaBoost 1.00 0.96 1.00 0.94 1.00 0.97 1.00 0.95
11 Tuned Gradient Boosting 1.00 0.96 1.00 0.94 1.00 0.95 1.00 0.94
12 XGBoost 1.00 0.97 1.00 0.94 1.00 0.98 1.00 0.96
2 Bagging Classifier 1.00 0.96 0.99 0.92 1.00 0.97 1.00 0.94
3 Tuned Bagging Classifier 1.00 0.95 0.99 0.92 1.00 0.95 1.00 0.94
4 Random Forest 1.00 0.96 1.00 0.92 1.00 0.97 1.00 0.94
5 Tuned Random Forest 0.97 0.94 0.97 0.92 0.95 0.92 0.96 0.92
7 Support Vector Machine 0.99 0.96 0.97 0.92 1.00 0.97 0.99 0.94
10 Gradient Boosting 1.00 0.96 1.00 0.92 1.00 0.97 1.00 0.94
1 Tuned Decision Tree 0.96 0.92 0.96 0.89 0.93 0.89 0.94 0.89
0 Decision Tree 1.00 0.92 1.00 0.84 1.00 0.95 1.00 0.89